Preparations

Load the necessary libraries

library(car)       #for regression diagnostics
library(broom)     #for tidy output
library(ggfortify) #for model diagnostics
library(DHARMa)    #for residual diagnostics
library(performance) #for residuals diagnostics
library(see)         #for plotting residuals
library(sjPlot)    #for outputs
library(knitr)     #for kable
library(effects)   #for partial effects plots
library(ggeffects) #for partial effects plots
library(emmeans)   #for estimating marginal means
library(modelr)    #for auxillary modelling functions
library(tidyverse) #for data wrangling

Scenario

Here is an example from Fowler, Cohen, and Jarvis (1998). An agriculturalist was interested in the effects of fertilizer load on the yield of grass. Grass seed was sown uniformly over an area and different quantities of commercial fertilizer were applied to each of ten 1 m2 randomly located plots. Two months later the grass from each plot was harvested, dried and weighed. The data are in the file fertilizer.csv in the data folder.

FERTILIZER YIELD
25 84
50 80
75 90
100 154
125 148
... ...
FERTILIZER: Mass of fertilizer (g.m-2) - Predictor variable
YIELD: Yield of grass (g.m-2) - Response variable

The aim of the analysis is to investigate the relationship between fertilizer concentration and grass yield.

Read in the data

fert = read_csv('../public/data/fertilizer.csv', trim_ws=TRUE)
## 
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   FERTILIZER = col_double(),
##   YIELD = col_double()
## )
glimpse(fert)
## Rows: 10
## Columns: 2
## $ FERTILIZER <dbl> 25, 50, 75, 100, 125, 150, 175, 200, 225, 250
## $ YIELD      <dbl> 84, 80, 90, 154, 148, 169, 206, 244, 212, 248
## Explore the first 6 rows of the data
head(fert)
str(fert)
## spec_tbl_df [10 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ FERTILIZER: num [1:10] 25 50 75 100 125 150 175 200 225 250
##  $ YIELD     : num [1:10] 84 80 90 154 148 169 206 244 212 248
##  - attr(*, "spec")=
##   .. cols(
##   ..   FERTILIZER = col_double(),
##   ..   YIELD = col_double()
##   .. )

Exploratory data analysis

Model formula: \[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_1 x_i \]

where \(y_i\) represents the \(i\) observed values, \(\beta_0\) and \(\beta_1\) represent the intercept and slope respectively, and \(\sigma^2\) represents the estimated variance.

ggplot(fert, aes(y = YIELD, x = FERTILIZER)) +
  geom_point() +
  geom_smooth()

ggplot(fert, aes(y = YIELD, x = FERTILIZER)) +
  geom_point() +
  geom_smooth(method = "lm")

ggplot(fert, aes(y = YIELD)) +
  geom_boxplot(aes(x = 1))

Fit the model

fert.lm <- lm(YIELD~1+FERTILIZER, data = fert)
fert.lm <- lm(YIELD~FERTILIZER, data = fert)

Model validation

Model outputs

Model investigation / hypothesis testing

Predictions

For simple models prediction is essentially taking the model formula complete with parameter (coefficient) estimates and solving for new values of the predictor. To explore this, we will use the fitted model to predict Yield for a Fertilizer concentration of 110.

predict

## establish a data set that defines the new data to predict against
newdata = data.frame(FERTILIZER = 110)
## using the predict function
predict(fert.lm, newdata = newdata)
##        1 
## 141.1867
## include confidence intervals
predict(fert.lm,  newdata = newdata,  interval = "confidence")
##        fit      lwr      upr
## 1 141.1867 126.3506 156.0227

Conclusions:

  • we expect (predict) a Yield of 141.19 associated with a Fertilizer level of 110.
  • confidence intervals represent the interval in which we are 95% confident the true Mean value of the population falls.
  • prediction intervals (not show, but described for comparison) represent the interval in which we are 95% confident a single observation drawn from the population mean will fall.

manual calculations

## establish a data set that defines the new data to predict against
newdata = data.frame(FERTILIZER = 110)
## Establish an appropriate model matrix
Xmat = model.matrix(~FERTILIZER, data = newdata)
## Perform matrix multiplication
(pred <- coef(fert.lm) %*% t(Xmat))
##             1
## [1,] 141.1867
## Calculate the standard error
se <- sqrt(diag(Xmat %*% vcov(fert.lm) %*% t(Xmat)))
## Calculate the confidence intervals
as.numeric(pred) + outer(se, qt(df = df.residual(fert.lm),  c(0.025, 0.975)))
##       [,1]     [,2]
## 1 126.3506 156.0227

emmeans

The emmeans package has a set of routines for estimating marginal means from fitted models.

## using emmeans
newdata = list(FERTILIZER = 110)
emmeans(fert.lm, ~FERTILIZER, at = newdata)
##  FERTILIZER emmean   SE df lower.CL upper.CL
##         110    141 6.43  8      126      156
## 
## Confidence level used: 0.95

Note, the above outputs are rounded only for the purpose of constructing the printed table on screen. If we were to capture the output of the emmeans() function, the values would not be rounded.

Additional analyses

Summary figures

References

Fowler, J., L. Cohen, and P. Jarvis. 1998. Practical Statistics for Field Biology. England: John Wiley & Sons.